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ON THE CONTRACTION PROPERTIES OF SOME 
HIGH-DIMENSIONAL QUASI-POSTERIOR 
DISTRIBUTIONS 

By Yves F. Atchade* 

University of Michigan 

We study the contraction properties of a quasi-posterior distri¬ 
bution ri„,d obtained by combining a quasi-likelihood function and 
a sparsity inducing prior distribution on R'*, as both n (the sample 
size), and d (the dimension of the parameter) increase. We derive 
some general results that highlight a set of sufficient conditions un¬ 
der which tln,d puts increasingly high probability on sparse subsets of 
R"*, and contracts towards the true value of the parameter. We apply 
these results to the analysis of logistic regression models, and binary 
graphical models, in high-dimensional settings. For the logistic re¬ 
gression model, we shows that for well-behaved design matrices, the 
posterior distribution contracts at the rate 0(-\/s* log(d)/n), where 
s* is the number of non-zero components of the parameter. For the 
binary graphical model, under some regularity conditions, we show 
that a quasi-posterior analog of the neighborhood selection of [29] 
contracts in the Frobenius norm at the rate 0(y(p + 57log(p)7n), 
where p is the number of nodes, and S the number of edges of the 
true graph. 


1. Introduction. Let denote a sample space equipped with a ref¬ 
erence sigma-finite measure denoted dz. The upper script n represents the 
sample size. Let Z he a 2^(”)-valued random variable that we model as hav¬ 
ing distribution given a parameter 0 G We assume that Pg”^ has a 
density fn,e- = fn,eiz)dz. Let 11 be a prior distribution on M'^. The 

resulting posterior distribution for learning the parameter 6 is the random 
probability measure 


J^,UAZMdoy 


A meas. C M'^. 


In practice, many inference problems are best tackled using quasi-likelihood 
(or pseudo-likelihood) functions. In the Bayesian framework, this leads to a 
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quasi-Bayesian inference. Let {0,z) i-A qn,e{z) denote a jointly measurable 
function such that 0 < 9n,6»('2^)n(d0) < oo, almost surely [d^]. Substitut¬ 

ing Qnfi in place of fn,e yields the quasi-posterior (QP) distribution 


( 1 . 1 ) 




def 


f^qn,e(midO) 

J^,q^,eiZ)U{de)' 


A C 


Although tln 4 is not a posterior distribution in the usual sense, it possesses 
the property that it is a probability distribution obtained by tilting a prior 
distribution using a likelihood-like function. Hence, to the extent that the 
quasi-likelihood function 6 i—)• qnfi{Z) contains information about the true 
value of the parameter 0 , one can expect the same from the quasi-posterior 
distribution ( 1 . 1 ), in which case valid inferential procedures can be derived 
using rin.d. This idea is perhaps best seen by noting that (1.1) is a solution 
of the minimization 

min -[ logg„,e(Z)/r(d 6 ») KL(/r|n) , 

M<n [ J^d 


where KL(/r|n) log(d/z/dn)d/i is the KL-divergence between // and 

n, and where the minimization is over all probability measures that are 
absolutely continuous with respect to the prior H. We refer to [36] for more 
details (and in particular to Proposition 5.1 of that paper for a proof of the 
above statement). The implication of this result is that, under appropriate 
regularity conditions, one can expect the QP distribution to concentrate 
around the maximizer of the function 9 i—)• logg„^ 0 (Z), provided that the 
prior distribution does not prevent it. The goal of this paper is to formalize 
this idea for a class of statistical models. 

As pointed out to us by a referee, QP distributions are commonly used 
in the PAC-Bayesian framework to aggregate estimators ([28, 15, 17, 1, 2]). 
However in this literature the emphasis is typically on the estimators, not 
on the QP distributions themselves. An influential work on quasi-Bayesian 
procedures is [16], which subsequently led to the development of quasi- 
Bayesian inference in semi-parametric modeling, particularly models aris¬ 
ing from moment and conditional moment restrictions ([26, 35, 22, 24]). 
Approximate Bayesian computation (ABC) methods (see e.g. [27] and the 
references therein) are also popular quasi-Bayesian procedures. 

The present paper is motivated by the idea that quasi-Bayesian inference 
holds a great potential for dealing with high-dimensional statistical mod¬ 
els. For some of these models, a likelihood-based inference is intractable, 
and this has impeded somewhat the applicability of the Bayesian frame- 
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work in this area. However, M-estimation procedures that maximizes vari¬ 
ous quasi/pseudo-likelihood functions are often readily available. Using the 
quasi-Bayesian framework, these quasi-likelihood functions can be easily em¬ 
ployed to derive tractable quasi-Bayesian procedures. 

We study the behavior of the QP distribution (1.1) when the prior distri¬ 
bution n is given by 

( 1 . 2 ) U{de) = 7r5n(d0|<5), 

-SeAd 


for a discrete distribution {Tr^,^ G A^} on = {0,1}'^, and a sparsity 
inducing prior n(d0|(5) on M'^, that we build as follows. Given 6, the compo¬ 
nents of 6 are independent, and for 1 < j < d, 


(1.3) 


ej\6^ 


Dirac(O) if dj = 0 

Laplace(/9) if Sj = 1 ’ 


where Dirac(O) is the Dirac measure on M with full mass at 0, and Laplace(/?) 
denotes the Laplace distribution with parameter p > 0. The marginal prior 
distribution of 6j implied by (1.3) belongs to the class of spike-and-slab 
priors ([30]). 

We work under the assumption that Z ~ for some 0* G When d 
is assumed fixed and n —)■ oo, it is known from the initial work of [16] that 
Iln,d concentrates around 0*, and is asymptotically Gaussian (when properly 
scaled). Infinite-dimensional extensions of such results have recently been 
studied ([26, 18, 22]). The present paper focus on the case where tln,d arises 
from a high-dimensional parametric model with the sparsity inducing prior 
(1.2-1.3), and the results that we derive substantially extend previous works 
by [13, 24]. More precisely, we derive a general result (Theorem 3) that high¬ 
lights the key determinants that control the convergence and convergence 
rate of tin,d towards 0*. The theorem is obtained by combining ideas from 
[13] together with a general methodology for studying high-dimensional M- 
estimators synthesized in [31], as well as an important technical result by 
[23] on the existence of test functions. 

We apply these results to the Bayesian analysis of high-dimensional logis¬ 
tic regression models. We derive a non-asymptotic result (Theorem 4) that 
shows that for large d, and appropriately large sample size n, the resulting 
posterior distribution puts a high probability on sparse subsets of 
and contracts towards the true value of the parameter 0* as n,d ^ oo, at 
the rate 


O 


s* log(d) 
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where s* = ||0*||o- The constant in the big-0 notation depends crucially on 
some smallest restricted eigenvalues of the Fisher information matrix of the 
model. 

We also apply the results to a quasi-Bayesian inference of high-dimensional 
binary graphical models. Discrete graphical models are known to pose sig¬ 
nificant difficulties due to the intractable nature of the likelihood function. 
A very successful frequentist approach to deal with large graphical models 
is the neighborhood selection method of [29] initially proposed for Gaussian 
graphical models, and extended to the Ising model by [32]. We analyze a 
quasi-Bayesian version of neighborhood selection applied to binary graphical 
models. We show that as n,p —)• oo (where p is the number of nodes in the 
graph), provided that n is sufficiently large, the QP distribution obtained 
from neighborhood selection contracts towards the true model parameter 0* 
in the Frobenius norm at the rate 

{p + S) log(p) 
n 




where S is the number of edges in the graph defined by 0*. This convergence 
rate is the same as in the Gaussian case with a full likelihood inference ([9]), 
and compares very well with the best existing frequentist results. For in¬ 
stance [34] shows that the scaled g-Lasso version of neighborhood selection 
in the Gaussian case converges at the rate O ^s*y^log((i)/n^ in the spec¬ 
tral norm, where s* is the maximum degree of the graph defined by 0*. In 
general, faster convergence rate can be achieved if one is only interested in 
components of the matrix. To illustrate this we analyze the contraction of 

- def 

nn,d in the norm = maxj 110.jll2, where 9.j is the j-ih. column of 9. We 
show that in this norm, the QP distribution obtained from neighborhood 
selection contracts towards 0* at the rate 


where here s* is the maximum degree of the graph defined by the true 
parameter 0*. Furthermore, the sample size n required for this result to 
hold is milder, and comparable to the sample size requirement in simple 
high-dimensional logistic regressions. 

An important issue not addressed in this work is how to obtain Monte 
Garlo samples from the QP distribution (1.1). It is well known that poste¬ 
rior and quasi-posterior distributions built from discrete-continuous mixture 
priors as in (1.2)-(1.3) are computational difficult to handle with standard 
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Markov Chain Monte Carlo algorithms. However there has been some re¬ 
cent progress, including the STMaLa of [33], or the Moreau approximation 
approach of the author developed in [5]. We point the reader to these works 
for more details and some additional references. Further discussion of com¬ 
putational methods can be in [13]. 

The remainder of the paper is organized as follows. First we close the 
introduction with some notation that will be used throughout the paper. 
Section 2 develops a general analysis of the QP distribution rtn,d. The appli¬ 
cations to logistic regression models and binary graphical models is discussed 
in Section 3. The proof of Theorem 3 is presented in Section 5, while the 
remaining proofs are gathered in the supplementary material [3]. 

1.1. Notation. For an integer d > 1, we equip the Euclidean space 
with its usual Euclidean inner product {-,■), associated norm || • [[ 2 , and 
its Borel sigma-algebra. We set {0,1}'^. We will also use the follow¬ 
ing norms on || 6 »||i ll^'llo Ej=i l{|e,|>o}> and || 6 l||oo 

maxi<j<rf \6j\. 

For 6 G Arf, /irf 5 denotes the product measure on dehned as 

d 

usj {dOj), 

1=1 

where noidz) is the Dirac mass at 0, and v^dz) is the Lebesgue measure on 
M. For 0, O' G 0 • 0' G denotes the component-wise product of 9 and 
O': (0 • 0')j = OjO'p 1 < i < d. And for 5 G A^, we set (5'^ 1 — 5, that is 

'= 1 — 5j, 1 < j < d. For 0 G the sparsity structure of 0 is the element 
6 € Ad defined as dj = l{| 0 ^.|>o}, 1 < j < d. 

Throughout the paper e denotes the Euler number, and (™) is the combi¬ 
natorial number m\/{q\{m —q)\). Eor x G M, the notation [x] represents the 
smallest integer larger of equal to x, and sign(x) is the sign of x (sign(x) = 1 
if X > 0, sign(x) = — 1 if x < 0, and sign(x) = 0 if x = 0). Finally, for 0 G 

and A C M'^, 0 -|- A {0 -|- u, u £ A}. 

2. Contraction properties of the quasi-posterior distribution Iln,d- 

We consider the QP distribution (1.1) on with the prior distribution 
(1.2-1. 3). Using the notation of Section 1.1, Hn.d can be written as 

(2.1) UnAd0\Z) (X qn,e{Z) ^ e-^ll^lli/rrf, 5 (d 0 ). 

<5eArf 
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We are interesting in the contraction behavior of fln,d for large n, d. We 
take the usual frequentist view of Bayesian procedures by assuming the 
following. 

HI. There exists 0* € such that Z ~ Pg”^(d 2 ) = fn^ei,iz)dz. 

We write for the expectation operator with respect to Pg”^(d 2 ;). 
We also make the basic assumption that the quasi-likelihood function is 
log-concave and smooth, and we use the notation V logqn^uiz) to denote 
the derivative of the map 6 i—)• logqnfi{z) at u. The j-th component of 
Vlogg'„,„( 2 ) is written as (Vlogg„,„(z))j. 

H2. For all z € the map 0 i— \ogqnp{z) is concave and differen¬ 

tiable. 


Remark 1. The assumption that the function 6 i—)• log qn,e{z) is concave 
is imposed mostly for simplicity, and is not crucial to derive the main result 
(Theorem 3). In fact, this assumption is not used in Theorem 3-(2). However, 
in the application of Theorem 3, concavity is typically crucial to control the 
events £n that appear in the theorem. 

Following [13], we specify the prior {vr^, 5 G A^} as follows. 

H 3. For all 6 G tts = S'||<5||o (||5 ^|q) j ® discrete distribution 
{ds, 0 < s < d}, for which there exist positive universal constants ci,C 2 , 
C3 > C4 such that 

(2-2) -^gs-i < gs <-^9s-i, s = 1,... ,d. 

Remark 2. This assumption guarantees that the prior distribution con¬ 
centrates on sparse subsets of Note that is the distribution of 

the number of non-zero components produced by the prior. The assump¬ 
tion in (2.2) guarantees that for d large enough so that ^ < 1, we have 
9s < and the rate ^ gets smaller with d. 

[14] has several examples of prior distributions that satisfy H3. For in¬ 
stance if, for some hyper-parameter u > 1, q ~ Beta(l, d^), and given q, we 
draw independently 6j ~ Ber(g), then the marginal distribution of 5 in this 
case satisfies H3, with ci = 1/2, C2 = 1, C3 = u and C4 = u — 1. 
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We study the contraction properties of Iin,d towards 0*. We borrow a 
strategy developed mostly for the analysis of high-dimensional M-estimators, 
that consists in identifying a “good” subset £n of the sample space on 
which the map 6 i—)• qnfi{Z) has good curvature properties (see e.g. [31] for 
an excellent presentation of these ideas). Using this idea, the task at hand 
then boils down to controlling the probability of the set £n and showing 
that IIn,d has good contraction properties when Z G To that end, and 
to shorten notation, we introduce the (Bregman divergence) function 

C.n,e{^) ^ogqnfi{z)-\ogqnfiAz)-{V\ogqnfiSz),9 - 6»*), 0 G z G Z^^\ 

This function plays a key role in informing on the curvature of the objective 
function 9 i—)■ log qn,e{Z) around 0*. However, in high-dimensional settings, 
it is typically not realistic to assume that 9 i—)• log qn,e{Z) has good curvature 
on the entire parameter space As well explained in [31], one should look 
at restrictions of Cn,e{z) to interesting subsets of 

We will use a rate function to express the curvature of 0 i—)• log qn,e{Z)- 
Throughout the paper, a continuous function r : [0, oo) —)> [0, oo) is a rate 
function if r is strictly increasing, r(0) = 0, and limaj^o 'i'{x)/x = 0. Given a 
rate function r, and a > 0, we dehne 

(2.3) (/>r(a) inf{a; > 0 : r(z) > az, for all z > x}, 

with the convention that inf0 = -|-oo. The main example of a rate function 
is r(x) = Tx^, for some r > 0 (for linear regression problems). However, 
the examples below are related to logistic regression and the rate function 
r(x) = Tx^/(1 + bx) is used. 

A non-empty subset 0 of is a cone if for all A > 0, and all x G 0, 
Ax G 0. We will say that a cone 0 is a split cone if u-x G 0 for all x G 0, and 
all u G {—1, l}*^ (we recall that the notation u-x denotes the component-by- 
component product). Split cones serve as generalizations of sparse subsets 
of M'^. The archetype example of a split cone is the set of s-sparse elements: 
{9 £ : Iloilo < 'S}- However in some problems, one might have to work 

with sparse elements with some additional structure, and this motivates the 
introduction of the split cones. A particularly important example of a split 
cone is the set of elements of with the same sparsity structure as 0*: 

(2.4) 0* |0 G M"* : 9j = 0 for all j s.t. 9^j = o} . 

Another important example of split cone that we will use is the set 

AA =W 0 G : 0 / 0, and ^ \9j\ < 7\\9 • (5*||i 

[ j- 5*j=0 
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where 5* denote the sparsity structure of 0 *: 1 < j < 

Given a rate function r, and a split cone 0 C we set 

(2.5) 

0 *^=1^ |z G ZW : for all 0 G 0* + 0, < -^r{\\9 - 0 *|| 2 )| • 

Here as in classical Bayesian asymptotics, in order to control the normal¬ 
izing constant of the quasi-posterior distribution, we need a lower bound on 
the function 9 i—)• Cnfi{z). Again, a restricted version will suffice. For L > 0, 
we set 

( 2 . 6 ) 

4i(0,A) : for all 0 G 0* + 0, £„,e(z) > -^||0 - . 

Finally, for A > 0 we set 

(2.7) £n,o{Q,^) = \z£Z^'^'>: sup |(Vloggn, 0 ,(z),u)| <I . 

( uG&, ||-!i||2 = l J 


The main idea behind these definitions is that on the event {Z G f’n,i(0, T)n 
£’n,i(0,r)} the quasi-log-likelihod function 9 i—)• logqnfi{Z) has very nice 
curvature properties when restricted to the set 0* -|- 0. The definition of 
^71,0(0)-^) implies that on the event {Z G £l„^o(0)-^)}) is close to the 
maximizer of the map 9 i—;■ logg„^e(Z). Hence the set £nfl{Q-, A)nf,iq(0, L)n 
^n,i( 0 ) r) is our example of a “good set”, and on that set, we expect tin,d{-\Z) 
to have good concentration properties around 0*. This is the substance of 
the next result. Before stating the main theorem, we introduce few more 
notation. For M > 0, let 6,^(0, M) {0 G 0* -|- 0, s.t. ||0 — 0*||2 < 
M}. For e > 0, let D(e, 6^(0, M)) denote the e-packing number of the ball 
Brf( 0 ,M), defined as the maximal number of points in 6 ,^( 0 , M) such that 
the II'll 2 -distance between any pair of such points is at least e. 


def 

Theorem 3. Assume H1-H3, and set s* = ||0*||o. Suppose that d is 
such that > 8 c 2 . Let 0 3 0* 6 e a split cone, L > 0,X > 0, and a rate 
function r he such that e 4>r (2A) is finite. 

1. Set £n £n,o0^'^: p) G ^n,i(0*) L) Pi £n,i{M , r). Then for any integer 
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k > 0, 

( 2 . 8 ) 


n, 


nA 


e G 




+ 2eUA + 


4L 


4c2 


where a = —| infa;>o [/"(x) — 4/?y^x], if J\f / 0, and a = 0 if Af = 0. 


. 2 . S'ei £’n,o(0, A) n fri,i(0*,L) n /')• For any Mq > 2, 


(2.9) 

E0) [n,4{0e0* + 0: 110-0, 


i>i 




+2 — 


\ 

Cl y 


2 > Moe} \Z)]< P0) [Z i 8n] 



where Dj D , 6^(0, (j + l)Moe)^, and where 
Co sup^ge sup„6e, ||^||^=i | {sign{u),v) |. 

Proof. See Section 5.1. □ 


Theorem 3-Part (1) shows that for p, L and r such that the event {Z G 
Tn,o(E'^, p) n Tn,i( 0 *) L) n £n,i{Af, r)} has high probability, one can use the 
second term on the right-hand side of ( 2 . 8 ) to establish that the concen¬ 
tration of the prior on sparse subsets (as assumed in H3) is inherited by 
the quasi-posterior distribution. In the logistic regression example below, 
we show that the term e^ 0 (e“* for some constant c. 

And since follows that for such models the right-side of 

(2.8) becomes small for k of the order of (c/c 4 )s*. The same is true for linear 
regression models ([13]). 

Part ( 2 ) of the theorem shows that if A, L, the split cone 0 and the rate 
function r are well chosen such that the event {Z G Tn,o(0) A) nTn,i(0*j L) H 
^nq( 0 ,r)} has high probability, then the convergence rate of the quasi- 

posterior distribution is controlled mainly by the series e~s'^''^~f and 
its dependence on n,d. In the examples below, we show how the terms on 
the right-hand of (2.9) can be handled. 

We note that Part (2) of the theorem controls only the probability of 
the event {0 G 0* -|- 0 : \\0 — d *||2 > A4oe} whereas in most applications we 
typically want the probability of {0 G : ||0 — 0*||2 > Moe}. As we will 
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show in the examples below, one can use Part (1) of the theorem to upper 
bound separately the probability of the event {0 ^ 0* + 0}. 

Finally, we point out that the upper bounds in (2.8) and (2.9) depends 
in general on 0*, typically through L and the rate function r. These terms 
essentially model the curvature of 0 i-A logqn,eiZ) around 0*. Our setting 
thus differs from the linear regression setting where the curvature of 0 i—> 
log QnfiiZ) is constant, and the resulting posterior concentration bounds are 
uniform in 0* ([13] Theorem 1 and 2). 


3. Sparse Bayesian logistic regression. As a first application we 
study the contraction behavior of a posterior distribution obtained from a 
high-dimensional logistic regression model, for large values of the sample 
size n and the dimension d. Suppose that Zi,..., are independent 0-1 
binary random variables and we consider the model 


F{Z, = 1) 


g{a:i,6*> 

1 -I- ’ 


for a parameter 6 £ where Xi £ is a known vector of covariates. 
Writing z = (zi,..., Zn), the likelihood function is then 


qn,e{^) = 




^ 2 = 1 





where 

g{x) log(l -|- e^), X £ R. 

Using the prior distribution given in (1.2)-(1.3), we consider the posterior 

distribution 

(3.1) 

( n \ ii^ii 

^) - 9 {{xi, 0)) I (0 Vrf,5(d6l). 

i=i J 5eA 

We make the following assumption that implies HI. 


Bl. Zi,..., Zn are independent 0-1 binary random variables, and there 
exist 0* £ R'^, xi,... ,Xn £ R'^, such that 


1 _)_ Q(Xi,d^) ’ 


nz, = 1) 


i = 1,..., n. 
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Let X G denote the design matrix, where the i-th row of X is given 
by the transpose of Xj. We shall write g', and to denote the hrst and 
second derivatives of g. Let W G be the diagonal matrix with z-th 

diagonal entry given by 

Wi = g^‘^'> {{xi,e^)), i = l,...,n. 

We define 

For s G {1,... , d}, we define 


def r e'ix'x)6 


Kl{s) = sup < 


t nmi ' 


and K;^(s) = inf 


def. Ae'{x'wx)e 


n 


: 1< Iloilo <4- 


We choose the regularization parameter p in the prior distribution (1.3) 


as 


(3.2) 


P 4||X||oo\/nlog((i), 


where ||^||oo maxjj|Wj|- We note that ki( 1) < ||X||^, and Ki(s) < 
ki( 1)/4, for all s > 1. 


def 


Theorem 4. Assume B1 and H3. Choose p as in (3.2). Set s* = ||0*||o? 


(3.3) C =4* + - + - ( 1 + 

C4 C4 


64||X|| 


+ 


log(4e)\ 

64||X||^(log(d))2 ' log(d) ) 




+ 


and s [s* -|- Cl - L/k min(K;^, K;^(s)) > 0, then there exists a universal 
eonstant T < 00 sueh that for all d large enough, and 

2 


(3.4) 


n>T||Xr^ log(d) 


K 


the following statements hold. 

1 . 




n. 


n.d 


({' 


lo > 


C} 


Z 


4 

< -. 
- d 
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2. There exists a finite constant Mq > 2 (that depends only on the con¬ 
stants in H3), such that 






MqIIXIIoo / glog(d) ] 
Ki{s) V n j 


\z 


12 

< —. 
- d 


Proof. See Section 5.2. 


□ 


If the dimension d is large, then 


c 


•s* H-1- 

C4 C4 



64||X||^\ 


Therefore, for design matrices X for which the restricted eigenvalues K 4 and 
^ 4 ( 5 ) of the matrix n~^X'WX are not too small, Theorem 4 implies that 
most of the probability mass of the posterior distribution is on sparse sub¬ 
sets of and the rate of convergence of the posterior distribution (3.1) is 

. The frequentist .^^-penalized M-estimator for logistic regres¬ 
sion has been analyzed by [31] (assuming a random design matrix X), and 
[25] (assuming a deterministic design matrix X), and is known to converge 
at the same rate, and under assumptions that are similar to those imposed 
above. Technically, our approach is closer to [25]. The approach of [31] leads 
to slightly better conditions on the sample size n (they require n to increase 
linearly in s*, not quadratically, as in (3.4)), at the expense of more structure 
on the design matrix (X is assumed to have i.i.d. rows from a sub-Gaussian 
distribution and positive definite covariance). 


O 


{\/^) 


Remark 5. As pointed out by a referee, one can use the convergence 
rate in Theorem 4-Part(2) with an argument used in [14] Theorem 2.2 to 
derive a bound on the convergence rate in the £'?-norm for q G (0, 2]: 




n 


n,d 


e G 


\\o-e. 


> 


MollXlloo(s)'i / log(d) ] 

Ki{s) V n j 


\Z 


< 


16 


This follows from the fact that for any r > 0, 


\\e-efi\g>r} c 

{0 G R'" : 110 - 0*11, > r, II 0 IIO < C} U ^ ■ Iloilo > <} ’ 
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and by Holder’s inequality, for 6* € such that ||0||o < C) 11^ “ ^*l|o < s, 
and 

\\0 - 9^\\q < \\d - . 

Obviously, the same argument can be used with respect to the general bound 
in Theorem 3, but the resulting bound would be more complicated. 

Remark 6. It is interesting to observe that the contraction result given 
in Theorem 4 Part(2) holds, not in spite of the large dimension d, but be¬ 
cause d is large. In other words, the result should be viewed as a form of 
concentration of measure phenomenon for tin,d as d —)• oo. In particular. 
Theorem 4 should not be applied to a fixed-dimension case in an attempt 
to recover standard Bayesian contraction results (fixed d, n —)■ oo). Indeed, 
note that for d fixed, the prior distribution H in (1.2-1.3) with p as in (3.2) 
converges weakly to a point-mass at 0 as n —)• oo, which is not a good behav¬ 
ior of a prior in hxed-dimensional settings. However with more appropriate 
prior assumptions, the argument in the proof of Theorem 3 can be easily 
modified to derive convergence rate results that would be applicable to the 
fixed-dimensional setting. We refer to [20] (and the references therein) for a 
good presentation of finite-dimensional Bayesian asymptotics. 

4. Quasi-Bayesian inference of large binary graphical models. 

As another example, we consider the Bayesian analysis of high-dimensional 
binary graphical models (sometimes called Ising models). Let Aip be the 
space of real-valued p x p symmetric matrices. For 6 G A4p, let fg be the 
probability mass function dehned on {0,1}^ by 
(4.1) 

1 \ 

fe{xi ,..., Xp) = — exp V OjjXj + V OijXiXj , Xj G {0,1}, l<j<p, 

Vi-I .<; / 

where Zg is the normalizing constant. We consider the problem of estimating 
6 under a sparsity assumption, from a matrix Z G where each row of 

Z is an independent realization from fg^ for some sparse 0* G Aip. This 
problem has generated some literature in recent years ([8, 21, 32, 4, 10] and 
the references therein), all in the frequentist framework. 

The Bayesian estimation of 9 is significantly more challenging because the 
normalizing constant Zg are typically intractable, and this leads to poste¬ 
rior distributions that are doubly intractable. In the frequentist literature 
cited above, the preferred approach for estimating 6 is via penalized pseudo¬ 
likelihood maximization, which nicely side-steps the intractable normalizing 
constants issue. The quasi-Bayesian framework developed in this work can 
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be used to combine these pseudo-likelihood functions with a prior distribu¬ 
tion to produce quasi-Bayesian posterior distributions. 

The most commonly used pseudo-likelihood function is obtained by taking 
the product of all the conditional densities in (4.1). This is an idea that 
goes back at least to [12]. Combined with a prior distribution II on Aip, this 
approach readily yields a quasi-posterior distribution on Aip that falls in the 
framework presented above. Note however that when p is large, say p > 500, 
the space Aip has dimension bigger than 10^, and MCMC sampling from 
this quasi-posterior distribution becomes a daunting and time consuming 
task. One interesting idea is to break the symmetry and to consider the 
quasi-likelihood 


(4.2) 


TT 

= J_ J_ J_ J_ 


= 1 i=l 1 T 6Xp ( 6jj -|- ^kj^ik 


Notice that the only difference between and is that the symmetry 
constraint in 9 is relaxed, that is the parameter space of the map 9 i—)• 
qnp{Z) is not Aip. However this difference has a huge impact since 

now qnfi{Z) factorizes along the columns of 9. As a result, maximizing a 
penalized version of (4.2) is equivalent to solving p independent logistic 
regression (assuming a separable penalty), and this can be done efficiently 
in a parallel computing environment. This pseudo-likelihood approach was 
popularized by the influential paper [29] in the Gaussian case, and extended 
to the Ising model by [32]. In a recent work ([6]), the author extended 
this idea to the Bayesian analysis of large Gaussian graphical models, and 
analyzed the contraction of the resulting quasi-posterior distribution using 
Theorem 3. Here we extend the method to the Ising model. 

Throughout this section, if 0 S W^p, 9.j £ denotes the j-th column of 
9. In view of the discussion above, and for a discrete probability distribution 
{tt^, 6 G Ap} on Ap, and p > 0, we consider the quasi-posterior tin^d on 
I^pxp given by 

(4.3) tln,dmz) (X (7n,0(^)n E 

i=i5eAp 
p 

= l[Un,d,M9.j\Z) . 

i=i 
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where is the probability measure on given by 


nn,dj(du|Z) oc 


exp {Zij {uj + UkZik}j^ 


=1 1 -L exp (Uj + Yjk^j UkZik 
X 


(5eA„ 




Remark 7. One of the limitation of the approach is that the distribution 
tin.d does not necessarily produce symmetric matrices. However, because of 
the contraction properties discussed below, typical realizations of tl-n.d will 
be close to be symmetric. Furthermore, from a practical viewpoint, one can 
easily remedy a broken symmetry using various symmetrization rules as 
suggested for instance in [29]. 

We make the following assumptions. 

Cl. The rows of Z £ are independent {0, 1}^-valued random vari¬ 

ables with common probability mass function fs^, for some 0* G Alp. 


We define 


def 11 „ II , def 

s*,' = and s* = max s*,-. 

i<i<p 


Hence s* is the maximum degree of the undirected graph encoded by 0*. 
The sparsity structure of 0* is the matrix <5* G {0, \}p^p defined as <5*^^ = 
A ~ and 1 < j < p, we dehne 


(Xi,..., X,_i, 1,X,+i,..., Xp) G 
(viewed as a column vector), and 


^0 ) jg 




(i) 


We set 


(4.4) 

i<j<p 




u 


l^lli 


, u G \ {0}, ||u||o < s 


and 


def . r ■ r ) u'TL^^^U 

Kn = mt mf < — - —tt;—, 
l<j<p 1 ||u||2 


u 


eK-\{0), ^ 7 ^ ^ \uk\ 

^*kj — 0 
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Remark 8 . It is easy to verify that 


V( 2 ) log 


n 


exp (^Zij (uj + UkZik^^ 

1 + exp (uj + Ykj^j UkZik^ 


i=l 


where Zj(j) = {Zn ,..., Zij^i, 1, Zij^i,..., Zip). Hence is the Fisher 

information matrix in the conditional model that regress the j-th column 
of Z on the remaining. The quantities and K 2 (the minimum over 

j of) restricted smallest eigenvalues of these information matrices. We will 
work under the assumption that ^ 2 ( 5 ) > 0 and K 2 > 0 ; some well-chosen 
s. Similar assumptions are made in most work on high-dimensional discrete 
graphical models ([32, 4, 10]). Although these assumptions are very natural 
in this context, to the best of our knowledge there does not seem to exist 
any easy way of checking them for a given parameter value 0 *. 


We will take the prior parameter p as 
(4.5) p = 24y^nlog(p). 

In order to apply Theorem 3 we view as M'^, with d = p^, equipped 
with the Frobenius norm ||0||p y^Tr(0'0), and inner product 

Tr(0'?9), where Tr(0) denotes the trace of the matrix 0. Throughout this 
section, the norm || • II 2 always denotes the Euclidean norm on RP. We will 
work with split cones of the form {9 £ : || 0 .j||o < sj, 1 < j < p}. 


Theorem 9. Consider the quasi-posterior distribution (4-3). Suppose 
that Cl holds, the prior {tt^, 6 £ Ap} satisfies H3 (with d replaced by p), 


and p 

is given by (4.5). 

For 1 < j < p set 

(4.6) 

def 4: 

2 / 128 

0 = 'S*i 4- 

+ — 1 + - + 

C4 

C 4 V A2 

_ def 
= 

[-1- Cj], and s 

def _ 

— j <c.p ^ j • 


^*3 




*31 


K 2 64(log(p))2 log(p) 

; min(K 2 )li 2 (^)) > 0; then 

there exist universal finite positive constants Ai, A 2 such that for all p large 
enough and 


(4.7) 


I jE 


log(p), 


■ i=i 


the following statements hold. 
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1 . 


eW [n„,d ({0 G : ||0.,||o > 0, for some j} \Z)] < 6"^=' 


P 


2. There exists a finite constant Mq > 2 (that depends on the constants 
in H3), such that 


E(«) 


- 1 

< 

( 

^n^d 

< 

1 - 

1 


0 G 


Dfix d 


■■ \\O-0*\\f> 


Mq 

«2(s) 


\ 


Proof. See Section 5.3. 


E' 

0=1 


log(p) 


n 


1^ 


< 2e-"^2n ^ 


12 
P ' 


□ 


If p and n are large while k remains bounded away from zero, Theorem 
9-Part(l) implies that the quasi-posterior distribution tin,d puts high proba¬ 
bility on matrices of with the same sparsity pattern as 0*, and Theorem 
9-Part (2) implies that in this case, the rate of convergence in the Probenius 
norm is of order 


O 


{p + S) log(p) 


n 


where S = Yfj=i is twice the number of non-zero components of 0*. As 
we show next, faster convergence rate is possible if one is only interested in 
components of 6. We consider the norm 

|||9|||1i.' max llftjib, «£«»>'». 

1<J<P 

Theorem 10. Under the assumptions of Theorem 9, if k > 0, then there 
exist finite universal constants Ai,A 2 , and a finite constant Mq > 2 (that 
depends only on the constants in H3) such that for all p large enough, and 
for 


n > Ai 




n 


n,d 


9 G 


r)dxd 


- 0*1 > 


k{s) 
Mq 


A2('S) 


log(p), 

log(p) 1 

n j 


Z 


< 2e 


—A2n 


+ - 


Proof. See Section 5.3. 


12 

P ’ 

□ 
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5. Proofs. 

5.1. Proof of Theorem 3. To improve readability we split the proof in 
three parts. The first part deals with the normalizing constant of the quasi¬ 
posterior distribution, the second part deals with the existence of test func¬ 
tions, and the proof of the theorem itself is given in the third part. 

5.1.1. On the normalizing constant of the quasi-posterior distribution. 
The next lemma provides a lower bound on the normalizing constant of the 
quasi-posterior distribution (2.1), following an approach initially developed 


by [13]. 


Lemma 11. Assume H1-H2. Fix T > 0, and a split cone 0 D 0*. For 
all z G T„q(0, L), 



(5.1) 


Proof. Using the dehnition of the prior 11, we have 



(5.2) 


For 2 G Tn,i(0,T), and 0 G 0* -|- 0* C 0,^ -|- 0, 



Setting 3} = Vlog 0 ^( 2 ), (5.2) then gives 



L 



X 


e 


0 *+ 0 * 


We note that the support of the measure Pdfii, is 0* = 0* + 0*- Using 
this and the change of variable 0 = 0* -|- z, we see that the integral on the 
right-hand size of (5.3) is 
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By Jensen’s inequality, 


/ 






Ir'^ ® 






> exp / (i9, z) 


fidA (d^) 


= 1- 


4.e-2ll“ll2-pll-lli^,^,^(du) 

Using this, and going back to (5.2) we conclude that 

7 r <^ (InfiA^) '^2/ jRd 

Now, note that 

It is easy to calculate that for a > 0, 6 > 0 


/ 

JR' 



where (p is the density of the standard normal distribution, and $ its cdf. The 
formula continues to hold by continuity at a = 0. The ratio (1 — ^{z))/p{z) 
(known as Mills’ ratio), satisfies 


(5.5) 


< 


1 + 2 ^ Z + Vz^+4: 


^ 1 - ^(^) ^ 
p{z) 


- , , z >0, 

3z + 's/ z^ + 8 


see for instance [11] Theorem 2.3 for a proof. We use this inequality and 
(5.4) to conclude that 



2^ 


dz > 


2p 

L + p^^ 


and the lemma follows easily. 


□ 


5.1.2. On the existence of test funetions. In this paragraph we establish 
the existence of test functions to test the density fn^e^ against some mis- 
specified alternatives Qn,e defined below. The result is based on Lemma 6.1 
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of [23], that we shall recall first for completeness. For any two integrable non¬ 
negative functions qi, q 2 on and for a £ (0,1), the Hellinger transform 
'Ha{Qi,Q 2 ) is defined as 

'Ha{qi,q 2 )= [ qf{z)ql~'^{z)dz. 

JzM 

def 

Here we work with the case a = 1/2, and set 'H{qi,q 2 ) = 'Hi/ 2 {qi,q 2 )- 


Lemma 12 ([23] Lemma 6.1). Let p he a probability density function on 
and Q a class of non-negative integrable functions on Z^^\ Then 

(5.6) 


inf sup 

‘t> q&Q 




(j){z)p{z)dz -\- (1 - 4>{z))q{z)dz 

Jz(^'> 


< sup n{p,q), 

q£conv{Q) 


where conv{Q) is the convex hull of Q, and the infimum in (5.6) is taken 
over all test functions, that is all measurable functions cf : Z^"'^ —>■ [ 0 , 1 ]. 
Furthermore, there exists a test function (f that attains the infimum. 


To derive the test function for our quasi-likelihood setting, we will also 
need the following easy result. 


Lemma 13. Fix X > 0, a split cone 0, and a rate function r such that 
4>r{^X) is finite. For any 0 £ 0* -|- 0 such that ||6* — 0*||2 > fri'^X), we have 


qn,eAA - 


z £ Tn,o(0,A) nT„,i( 0 ,r). 


Proof. For all z £ and 6 £ we have 

= exp [{V log qn,eAA A - G*) + Cn,e{z)] ■ 

By the definition of T„q(0,r), for 0 £ 0* -|- 0 and z £ T„q(0,r), we 
have Cnfi{z) < —ir(||0 — 0*||2). And by the definition of Tri,,o( 05-^)5 for 
z £ Tn,o(0) -^)) and 0 £ 0* -|- 0, we have 

|(Vlogg„,,,(z), 0 - 0 *)]<^|| 0 - 0 *|| 2 . 

Hence, for z £ EnpiQ, X) n Tn,i(0) Oi and 0 £ 0* -|- 0, 

(5-7) gn,e(^) ^ ^ 110 _ 0^11^ _ ir(||0_ 0^11^) , 

qnfiZ^) L ^ 

If in addition ||0 — 0*||2 > 4>r{2X), then from the properties of the rate func¬ 
tion r, we have 2A ||0 — 0*||2 — r(||0 — 6 **|| 2 ) < 0, and the result follows. □ 
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Our main result on the existence of test functions follows. We recall that 
for M > 0, and a split cone 0, M) = {0 G 0*-|-0 : s.t. ||0 —0*||2< 

Af}, and fore > 0, D(e, 6^(0, Af)) denotes the e-packing number of 6^(0, Af) 
in the norm 


12 - 


Lemma 14. Fix A > 0, a split cone 0, and a rate function r such that 

€ <?ir(2A) is finite. Set £n nfri,i(0) '')• For 9 G define the 

function 


(5.8) 


^ . . def . ^ Qn,e{z) 


U,eAz), zGZW. 


For any M > 2, there exists a measurable function cj) : [0,1] such 

that, 

l>i 

where Dj D 6^(0, (j + l)ALe)^. Furthermore, for all j > 1, all 
0 G 0* + 0 such that ||0 — 9^2 > 


/ ( 


{I - (l){z))Qn,e{z)dz < e « 




j’Me ^ 

2 \ 


Proof. First, notice that the function z i-A Qn,e{z) is integrable for all 
0 G 0* + 0. Indeed, using (5.7) for any such 9, and for z G £n- < 

exp (^ \\9 — 0 * 112 ) ■ Hence, 




qn,e{z) 
qn,9„ {z) 


fn,eAz)<^z < 


62 


\\e-e. 


Now, hx e > 2e (where e = (/>r(2A)), and fix 0 G 0* -|- 0 such that 

Hcif 

||0 — 0*||2 > e. Set Ve = {Qn,u ■ u G 0* + 0 and ||u — 0||2 < e/2}, and let 
conv(P 0 ) denote the convex hull of the set Vq. By Lemma 12 applied with 
P = fnfii,-, and Q = Vg, there exists a measurable function fig : Z'oS —[ 0 ,1] 
such that 


(5.9) [,/.,(Z)] < sup n{fn,e.,Q) 

Q&conv{Pg) 

and sup / {1 - fig{z))Q{z)dz < sup 'H{fn,e^,Q)- 

Q&Vg JZ(^') Q£com{'Pg) 
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Any Q S conv{V 0 ) can be written as a finite convex combination Q = 
^jC(jQn,uj where aj > 0, = 1, n G 0* + 0, and \\uj - 6 \\^ < e/2. 

However, since \\9 — 0*||2 > e, and ||tij — 6\\2 < e/2, we see that \\uj — 0*||2 > 
e/2 > e. Hence, using Lemma 13 and the definition of the Hellinger trans¬ 
form, we have 


nfn,e.,Q) 



Qn,Uj {z) 

Qn,eA^) 


fn,eAz)dz < 



a^e 


-ir(\\uj-e4^) 


Hence (5.9) becomes 


(5.10) 

IMZ)] < and 


sup / {1 — (l)e{z))Q{z)dz < e 

Q&Ve Jz(^) 


Now, given M > 2, we write {0 G 0* -|- 0 : \\9 — 6^2 > Me} = Uj>iB(j), 
where 


B(j) — {0 G 0* + 0, s.t. jMe < ||0 — 0*||2 ^ (j + l)iHe}. 

For each j > 1, let Sj be a maximal (jMe/2)-separated points in B(j). For 
each j for which B(j) / 0, and each point 6^ G Sj we can construct a test 
function <j) 0 ^ as above, with e = jMe. Then we set 

(j) = sup max (/gj,, 


where the supremum in j is over the indexes for which B(j) ^ 0. Now, any 
0 G 0* + 0 such that ||0 — 0*||2 > jMe will be within iMe/2 of a point 0^ in 
Si for some i > j. Hence by (5.10), for any such 0, 

/ (1 - 4>{z))Qn,e{z)‘^Z < [ (1 - (jgAzAQnfiA)^^ < 

Notice that the size of Sj is upper bounded by Dj. Using this and (5.10), we 
get 

E(-) [</-(Z)]<^D,-e“^(^), 

l>i 


which proves the lemma. 


□ 
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5.1.3. Proof of Theorem 3-Part(l). For integer A; > 0, let Ak {0 G 
■ Iloilo > s* + k}. We have 

eW [Un^MklZ)) < (Z (f £n) + T, 


where T = E^”) 


j InA^nidd) 

UAZ) , 


theorem to write 
1 




. We use Lemma 11, and Fubini’s 


T < 


^<5* 

1 


\ P J V JAk QnMZ) 


(5.11) 


— - ( 1 + 

^5* V 

X TT^ 

<5eAd 


L 




' Ak 




qn,e{Z) 

qnfiAZ) e-p\\^*h 


Pd,5{^0). 


We need to control the expectation on the right-side of (5.11). First note 
that f„^o(I^‘^)P) = {z ^ : ||Vlog(7n^e^(2;)||oo < f}- With this in mind, 

we see that z £ £n 'T p), and 9 G M'^, we have 

= exp[(Vlogg'„,0^(z),6» - 0*)-bl2n,6»(^)] , 

Qn.eA^) 


< exp 


r||0 — 0*||l + £n,e{z) 


Setting B{9) ^\\6 — 0*||i -|- p(||0*||i — ||0||i), it follows that for all 9 G 


(5.12) E(”) 


isAZ) 


Qn,6^Z) e 


,^gAZ) e-Pll®*lli 


Qn 


<eSWEW [l^jZ)exp(£„„,(Z))] 


We then write 


(5.13) 


< 


Ii-^11^-5^111+ ^11(0-0*) •5*lli- 


Using this bound in the expression of B{9) shows that if 0 ^ 0* + AA, then 
we have 

B{9) < -P-\\9.6lh + ^-\\{9-9AAAi 

< -^||0-0*||i. 


(5.14) 
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This bound together with the fact that the expectation on the right-side of 
(5.12) is always smaller or equal to 1 (which follows from the concaveness 
assumption) show that when 0 ^ 0 * -|- M, 




^sAz) 


qn,e{Z) e-p\W\^ 
q-nfiAZ) 


< e"4l 


Now, consider the case where AA / 0, and 0 — 0* € M. In that case, the 
definition of the set , r) and (5.12) yield 




UAZ) 


qn,e{Z) 

qnfiAZ) 


< (,B{9)-\r{\\e-0,\\2) _ 


From (5.14), 


7* 2 j 


B{e) - ir(||0 - 0 *|| 2 ) < -^||0 - 0*||i + 2p||(0 - 0*) • 5*||i - ir(|K 

and 

2 p ||(0 - 0 *) • ( 5 *||i - ir (||0 - 0*112) < 2 p ^||0 - 0*||2 - ^r (||0 - 0*112), 

< KII 6 '-6**112)- 4 pV^|| 0 -0*112] 


<-inf 

2 3:>0 


r{x) — 

Therefore, when 0 7 ^ 0*, and 6 £ 6^ + A 6 , we have 




^eAz) 


qn,e{Z) 

qn,eAZ) e-HI^*lli 




rix) — ApsA'^x 
0. In view of these calculations and 


where a = — ^ inf 2 ;>o 


. Note that a > 0, since limj,|o r(x)/x = 
]5.11), we conclude that 


T < e" 





Pd,5{^0)- 


Note that Pd,s{-^k) = 0 if ||5||o < s* + fc, and 





Ad,sid9) < 


^ p\j IFIIo 



4ll<5||o^ 
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Therefore, 


Using H3, 


T < e" 



— y 7r,4ll^llU 

TT;? 

* <5: ||5||o>s*+fc 


— E 

* <5: ||<5||o>^*+fc 


7r54ll'^llo = El y y 4i 


/ d\ d 

El v 4y^v^* 


5s* 


J = S*-|-fc 


5s* 

d 

s* 


j=s*-|-fc 


Vdc4; 


5s* 




4-. 

j = S*-|-fc 




For d large enough so that < 1, we have Yfj=s^,+k {^Y ^ 

which proves the stated bound. 


□ 


5.1.4. Proof of Theorem 3-Part(2). DefineC/(e) = {5 G 5*4-0 : ||5 — 5 *|| 2 > 
Afoe}. We apply Lemma 14 with A = A, 0 = 0, the rate function r and with 
M = Mq > 2. Notice e = i;^r(2A) is called e in Lemma 14. By Lemma 14 
there exists a measurable functions 4 > : [ 0 , 1 ] such that 

(5.15) [f){Z)] < y 

i>i 

where Dj D 6^(0, (j-|-l)Adoe)^. Using the test function (f>, we 

have 

Iln,d {U{e)\z) < 0(Z) + (1 - </.(Z))n„,d {U{e)\z ). 

In view of (5.15), it remains only to control the expectation of {l—(f{Z))tin,d {U{e)\Z). 

To do so, we set Bn Tn,o(0) ^) Ll Tn,i(0) i"), so that Bn T Bn^ Tn,i(0) A), 
and use Lemma 11 and Fubini’s theorem to write 


(5.16) 

eW [{l-ct>{Z))tin,d{U{t)\Z)\ =eW 


{l-cf{Z)) 


fm szTOnii") 


I 


Qn.ejZ) 

<?n,e*(^) 


n(d5) 


< (z ^ T„) + — (1 + ^ 


-^5. 

eW 


2 \ 


L 




'U(e) 


lgjZ)il-cfiZ)) 


qn,e{Z) 

qnfiAZ) 


n(d5). 
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We split U{e) as U{e) = Uj>iB(j), where 

B(i) = { 6 » e 0 * + 0 s.t. jMoe < \\e - 6»*||2 < (1 + j)Moe}. 

Therefore, and using the notation of Lemma 14, the integral in (5.16) is 




’Uiil) 




y] / / i'i-- Hz))Qn,e{z)dz 

4B(7) Uz(^) 


n(d0) 


From the prior 11, we have 


n(d0)< j;e-|K^)n(B(j)). 

1>1 


ePll^*llin(B(j)) = 

<56Ad 

and for 6 G B{j), 


p\ I|0||0 
2. 


B(i) 


e^(ll'^*lli-ll^lli)/id,5(d0). 


p(|| 0 *||i - || 0 ||i) < p\\e - 0*||i < -|||0 - 0*||i + ^p\\e - 0*||i 

< -|||6» - 6i*||i + ^pco \\d - 0*112 < -|||0 - 0*||i + SpcojMoe 
where cq = sup^ggSup^gQ m^h | (sign(u),u) |. Hence 


ePll'?dlin(B(j)) < TT^ ( I )' 




-5eAd 


'B(i) 


p\ Iloilo 
2 . 


-fl^ld; 


< TTs 

SeAd 
— gSpcoiMoe ^ 

SeAa 

Therefore, the second term on the right-hand side of (5.16) is upper bounded 
by 


2\ s* 


\<5GAd / ^ ^ k>l 

As in Part(l), using H3 and for > 4c2, 


g-1 r( ) gSpco fcMoe 


TT; 


;■ E 




-56Ad 


9si, 


j=0 


9s^ 


3=0 


J 


s*/ 5 s* V'S*/ V Cl y 
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This ends the proof. 

□ 


5.2. Proof of Theorem 4- Clearly, B1 implies HI, and H2 trivially holds 
true. Furthermore in this case the function Cn,e is given by 

n 

h^n,e{z) = -^g{{xi,e)) - g{{xi,ef)) - g'{{xi,9^)) {xi,e - Of ), 

which does not depend on z. To control this term, we will rely on a nice self- 
concordant properties of the logistic function g{x) = log(l -|- e^) developed 
by [7] Lemma 1, which states that for all xo,u G M, 


(5.17) g^‘^\xo){e + \u\ - < g{xo + u) - g{xo) - g {xo)u 

< - |n| - . 

Proof of Part( 1). We shall apply Theorem 3-(Part 1). Clearly, 6 i—)■ 

log 0 ( 2 ) is concave for all z G {0, !}"■. We define H{x) e~^ -|- a: — 1. It 
can be checked that H satisfies 

(5.18) H{x) > 77 ^, T > 0. 

2 + x 

This holds because (2 -|- x)H{x) — x"^ = {2 + x)e~^ + x — 2, the derivative of 
which is 1 — > 0, for all x > 0. Using (5.17), we get 

n 

Cn,e{z) < - ^5^"^ {{Xi,e,)) H (I {Xi,e-e,)\). 
i=l 

Furthermore, for 9 — 6^ ^ Af, we have 

\{x,,9-9^)\ < ||X||oo||0-0*||i <8||X|Usy'||0-0*||2. 


Using this, (5.18), and the definition of Ki, we get for all z G {0,1}”, 
Cn,e{^) ^ “I 


n 


< - 


2 -I- maxj \ {xi, 9 — 9f) 

riKip - 0*111 

2 + 8V5;||X||oo||0-0*||2^ 


X'WX 

.(9-9J—^(9-9, 


n 


(5.19) 


= -2^11^-^* 112 ), 
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where r(x) = nKiX^/{l + 4y^||X||ooa;). Hence, with this particular choice 
of rate function ^ ■, r)) = 0. Since < 1/4, it follows that 


8 n 


As a result, if 0 — 0* G 0*, Cn, 9 iz) > —(n/8)Ki(s*)||0 —0*|||. Hence ^ 

£^„,i(0*,Z)) = 0, for L = nKi(s*)/4. Finally, we have V logqn,e^iZ) = 
X]r=i ^*))) and by Hoeffding’s inequality, and a standard 

union bound argument. 



< 


[ max 
\l<j<d 


EA 


g'{{xi,e^))) Xij 


2 exp 



d l 


2 

d’ 



given the choice of p in (3.2) of the main paper. Hence we can apply Theorem 
3-Part (1). This says that for any A: > 0, 


eW ({0 


^llo Z s* -|- k'\\Z 
+ 2e~^ ( 4 


2 

< - 
- d 

Ki(s*) 


16||A||^log(d) 


d\ /4c2 


1 /2 

where a = (l/2)infa;>o r(x) — ApsJ x . It is not hard to verify that for 

- ~ 4 vTVr-cfe - -i' - (V3)&c. In the 


l+bx 


— CX 


T,b,c > 0, infa ;>0 

case of a, the condition r > (4/3)6c is satisfies if y/n > 64x (4/3)|| A||^s*y^log((i)/ 
and we have 

64|| Allies* log(d) 


a > —- 


"ii 


Using this and the combinatorial inequality it follows that 


H, 


i,d 




Uil 


e(^) 

-|-2exp 
Then for a > 0, choose 

/ X , 2a 2 / 

(5.20) k = — + — [1 + 

C4 C4 V 


9eR^: ||0||o >s* + fc}|z) 

Ki(s*) 


2 

< - 
- d 


IL 


log(4e) \ 
64||X||^log(d)2 ' log(d) y 




-|- A: log 


4c2 


64||X|| 




Kl[Si, 


ai 


log(4e) \ 

64||X||^(log(d))2 ' log{d)) 




s*, 




























HIGH-DIMENSIONAL QUASI-POSTERIOR DISTRIBUTIONS 


29 


to conclude that the second term on the right-hand side of the above in¬ 
equality is upper-bounded by provided that > 4c2. Setting a = 1 
proves the theorem. □ 


Proof of Part( 2). We apply Theorem 3-Part(2) with A = /9\/l with p 
as in (3.2) of the main paper, and s = C -|- s*, with as in Part (1). We also 
choose L = nKi(s*)/4, 0 = {0 G : \\0 — 0*||o < s}, the rate function 

rjx) = nKi{s)x^/{l + y/^\\X\\aox/2), and £n = ^n,o(0,A) n Tn,i(0*,T) n 
^n,i(0)i')- With similar calculations as in Part (1), it is easy to establish 
that 

If 0 ^ 0, then ||0||o > s — s* = ("j and by Part (1), we conclude that 


jeW 


n„,rf(R''\0|z) 



Recall that </>r(a) = inf{x > 0 : r(z) — az > 0, for all z > x}. Since 
r(x) = nKi{s)x^/(l + v^||X||oot/ 2), if nKi{s) — ||oo > 0, then 


^ nKi(s) - s1/2A||X||oo 

Then we take n large enough so that (3/4)reKi(s) > s^/^A||X||oo, to conclude 
that 


_ A ^ - 16|l^lloo I slogjd) 

nKi(s) — s^/2A||X||oo ~ nKi{s) !li{s) V n °° 

The condition (3/4)nK;^(s) > s^/^A||X||oo translates into the sample size 
condition > (16/3)||X||^(s/K;i(s))y4og(d), which holds by assumption. 
We fix Mo > max(500,1 -|- (cs -|- C4/2)/8), and apply Theorem 3 to get: 


(5.21) 

eW 


n 


n,d 


G : \\9 - 0*11 > Moe} 


< 


i>i 


e 8 2 ' 


+ 2 





i>i 


Since (/>r(o) is defined as inf{x > 0 : r(z) > az, for all 2 : > x}, and 
jMoe/2 > e = 0r(2A), we have r(jMoe/2) > 2A(jMoe/2) = py/EjM^e. Hence 


(5.22) ^ 


e 



i>i 


i>i 


g-|Mo\/spe 
I _ g-|AfoVipe 
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where the last inequality follows from the bounds 


O O 


A 


nKi(s) 


sWXP 

= 2Mq -Tr^log(d) > 8 Moslog(d) > 1 

lii(s) 


since SMqs > I 6 M 0 /C 4 > 1, and log(d) > 1, by assumption. Using the 
arguments in Example 7.1 of [19] shows that the packing numbers Dj satisfies 
supj>i Dj < (^)(24)* < It follows that 


E". 

i>i 




e 8 


< 2 exp 

2 




< 


d’ 


provided that log(d) > 1, and using the condition SMq > C4/2 + l + log(24e) 
Setting X = jMoe/2, we have 


„ - 1 .jMoe x( nK^{s)x 

3pA.Mo.-^r(—) < 

nKi(s)^^ 


< 


(5.23) 

provided that 


< - 


riKiis)^ 




^ \ 1 + ^\/s||-^||oo— 2 “ 
2pyfix 


- — 48p\/s > 2py/s. 


48pVs^ , 

■ — 48p-\/I j 


This latter condition holds for all Mq > 500, if ^/n > 125s||X||^ Y^log(d)/K;^(s). 
In which case, from (5.23) we have 


E^ 

i>i 


3ps-l/2jMo.- 


- (a) 


g-|jAfox/7pe 

i>i 


where the inequality (a) uses (5.22). In conclusion, the last term on the 
right-hand side of (5.21) is upper-bounded by 


(5.24) 




Cl J 


5* 


1^ 1 + ^ g-8Moslog(d) < 4 




S 4 exp 


Ki(s*) 


^ f-I , , log(c/ci) 

».log(rf)(^l + C3 + ^^ 


64||X|||,log(d)^ 


— 8 Moslog(d) 
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Given that s = s* -|- C with ( as in Part (1), since log((i) > log(e/ci), and 
8 Mo > 2 -|- C 3 , we see that the right-side of (5.24) is upper-bounded by 
< (4/(i). The theorem follows. □ 


5.3. Proof of Theorem 9 and 10. It is obvious that HI and H2 hold for 
this example. For convenience in the notation, for 2 : G 1 < j < p, we 

let G be the matrix obtained by replacing all the components of 

the j-th column of 2 by 1. We introduce 


/ \ 5 

i=i 1 -L exp \ uj + Y.k^j UkZik) 

and = log q^l{z)-log . {z)-(y log . (z),u - , u G 

The function u qn)a{z) is the likelihood function of the logistic regres¬ 
sion model of the j-column of 2 : on z^^\ Let 'Hn\z) ^ogqn^g^. fz). 

Specifically, we have 


i=l 



-L E Zik 9irkj 


^is '^it ’ 


1 < S,t < p. 


We will need the following restricted smallest eigenvalues of 'Hn\z). 


\z) = inf 


u'{'Hn\z))u 


n\\u\\ 


uGM?>\{ 0}, |ufc|<7 

^*,kj 


E 

^*,kj — 1 


\uk\- 


inf 


u'{nli\z))u 

nlluiP 


uGMP\{0}, 



^2(^) 


inf K 2 \z), and K 2 {s,z) 
i<i<p 


inf K 2 \s,z). 
i<j<p 


The next result shows that if ^ 2 ( 5 ) > 0 and K 2 > 0 (with K 2 {s) and K 2 
as defined in (4.4) of the main paper, then with high probability K 2 {Z) > 0 
and ^ 2 ( 5 ,.^) > 0. The proof is an easy modification of the argument of 
[4] Lemma 2.5. We omit the details. 


Lemma 15. Assume Cl. There exist finite universal constants 01,02 
such that the following two statements holds true. 
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1. For 1 < s <p, i/K 2 (s) > 0, and n> ai j log(p), then 

(^K2is,Z) < 

If H 2 > 0 , and n > ai log{p), then 


p(n) 





Proof of Theorem 9-Part(1). We will reduce the result to Theorem 
4 Part(l). We set 

{z G : K^\z) > kJ2}, and G = {z G R^^P ; ^ 2 ( 2 ) > ^ 2 / 2 } 

We also define = {u G : ||u||o < Cj}- Define 0 {9 G RP^p : 

ll^-jllo < Sj, I < j < p}. Hence if 0 ^ 0, then O.j ^ A^^\ for some j. 
Therefore, 




n 


n,d 


ndxd ' 


\l ^ 

\0|ZJ < (Z ^ g)+^E(' 


n) 




i=i 

Note that Q C G^^\ and {Z G G^^^ is measurable. Hence by condi¬ 

tioning on Z^^\ we get 




+ ^eW [lg(,)(Z)EW (M''\^('')|Zj |Z(^') 

i=i 


By conditioning on Z^^\ and for Z G G^^\ we are taken back to the setting 
of the standard logistic regression with a well-behaved design matrix. With 
the choice of Qj, and since p in (4.5) of the main paper is taken larger than 
^■sjn log(p), by Theorem 4 (1), there exists an absolute constant Ai such 
that for p^^* > 8 c 2 max(l, 2 C 2 ), and n > Ai(s*/k 2 )^ log(p), we have 







The term p^ in 4/p^ comes from using a = 2 in (5.20). Without any loss 
of generality we can take Ai as large as the constant ai in Lemma 15 to 
conclude that {Z ^ G) < 6 ““^"’. Hence 


E(0 


fln,d ( m ''^''\ 0 | z ) 




4 

H-! 

p 


as claimed. 


□ 
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Proof of Theorem 9-Part(2). We shall apply Theorem 3-Part(2). We 
will apply the theorem with the split cone 

{6 We.jWo <Sj, l<j<p}. 

Here the norm || • ||2 in Theorem 3 is the Frobenius norm ll'Hp, whereas the 
notation || • ||2 in what follows will denote the Euclidean norm on Notice 
that if 0 ^ 0* -|- 0, then 0^0 (where 0 is as defined in Theorem 9). Hence 
we will use Theorem 9 to control the term \ {O* + 0)1-2^)) • 

More precisely, there exist universal positive constants Ai, A 2 such that for 
p'^'^ > 8 c 2 max(l, 2 C 2 ), and n > Hi(s*/k 2 )^ log(p), 

(5.25) E(") \ (0* + 0)1^)) < e"^ 2 n ^ 

Set S ^ — P '^1 = ?T'S*/4, r(x) = nK{s)x^/ (2 -|- S^^'^x), and 

consider Sn = Snfi{Q, A) n £l„^i(0*, L) n f„^i(0, r). We have 


sup |(Vlogg„, 0 ^(Z),u)p| < 


A '^Sj\\Vlogqn,eAZ)\\c 

\j=l 


Using this and a standard Hoeffding inequality, we obtain that 

(5.26) 


{Z ^ £'„,o( 0,A)) < 2exp ( 21og(p) - 


1 

2 n 


A 




2 
“ 5 

p 


< , 


given the choice of A, and p in (4.5). 

We use a second order Taylor expansion of u 1 —^ qn}u{^) around and 
the fact that g^'^\x) < 1/4 to deduce that for all 0 G 0* -|- 0* 


i=i 




ns* 


n 




Hence with L = ns*/4, 

(5.27) pW(Z^4i(0*,^)) = 0. 

Consider the set 
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Take Z ^Q. Then for all j, > k{s)/2 and we can use the same 

argument in (5.19) to conclude that for 0 — O-i, G Q, 

Aj) (y^ ^ nK{s)\\e.j - e^.j\\l/2 

2 + ^\\0.j-d^.,\\2' 

It follows that for 9 — 9^, £ 0, 


^n,e{Z) < 


nK{s)\\9.j - 9^.j\\l/2 ^ 1 nK{s) \\9 - 9^\\l 

2 + ^\\9.j - 9^.j\\2 - 2 2 + 5V2 ||0 _ 0^11^ 



Hence, with the rate function r(x) = nK{s)x^/{2 + we have 

(5.28) {Z i 4,1 (0, r)) < {Z iQ)< 

as seen in Lemma 15, provided that n'> Ai log(p) (without any loss 

of generality, we take A\ greater than the constant ai in Lemma 15). Hence, 
with £n = £n,o(&,^) n4,i(0*4) 04 , 1 ( 0 +)) it follows from (5.26)-(5.28) 

that for n> Ai log(p) 

(5.29) P(^)(Z^T„) <e-“2” + -. 

P 

Finally, we note that with the same calculations as in the proof of Theorem 
4 Part(2), we can choose the constant ai such that for n> Ai log(p). 


2A 

nK{s) 


< e = </>r(2A) < 


4A 

nK{s) 


< 


k{s) 


96 Slog{p) 


< 00 . 


n 


We are then ready to apply Theorem 3-Part(2). Fix Mq > max(500,1 + 
(c 3 + C4/2)/8), set V {9 G : ||0 —0*||f > Mqe}, then for n > 

Ai log(p), (2.9), (5.25), and (5.29) give 


(5.30) 

eW [UnAv\z)] < 



+ 



+ D,e-iK^) 

l>i 



^g-|r(L^)g3pcofcAfof 

k>l 
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Similar calculations as in the proof of Theorem 4 Part(2) shows that 


and < 2 e-i 6 AfoSiog(p)^ 

i>i ^ i>i 


and 




,C3 \ 


L 

14-2 


< 2 ?* exp I I ^ s^j 1 log(p) f 1 -L C3 -L + 


0=1 


log(p) 4(242) log(p )2 


Hence, and by the same argument as in the proof of Theorem 4 Part(2), the 
last term on the right-side of (5.30) is bounded by 4/p. □ 


Proof of Theorem 10. We will reduce this result to Theorem 4 Part(2). 
We set 

V {0 G : \\ 0 .j \\2 > ej, for some j}, 

and V 0 n V, where 0 = {0 G : ||0.j||o < Sj, 1 < j < p}. Using 
Theorem 9 as we did in (5.25), there exist universal positive constants Ai,A 2 
such that for > 8 c 2 max(l, 2 c 2 ), and n > ^i(s*/k 2 )^ log(p). 


[nno(v|z)] < e(”) 


n. 


nA\ 


X)dxd 


\0|z)l +eW [finAv\z)] , 


< e-^^^ + -+E^^^[Un,d{V\Z)]. 


We define 


{z G : K^\z) > ^ 2 / 2 }, and G 

We also define {u G Mi* : ||u||o < Sj, 

0 G V, then O.j G A^^\ for some j. Therefore, 


= {zem^^P: K2 {z)>K2/2}. 
and ||u ||2 > ej}. Hence, if 


E(”) [n„,rf(V|Z)] < p(") (Z i g)+^EW [lgo)(Z)E(") 


i=i 


H 


n,d,j 


z) |z(l) 


Fix Mq > max(125,C4(l -|- C3)/64). E^") [^n,d,j (-41 ^-^^jZ)] is the same as the 
posterior distribution of the logistic regression of the j-th column of Z on 
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and for G we can apply Theorem 4 Part(2). Hence, we can 
take Ai large enough so that forp > e(l + ci)/ci, and n > Ai{s/^ 2 )^ \ogij)), 



Hence 



□ 
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